Introduction to Open Data Science - course project

RStudio, GitHub & Open data sources

Text book: Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences , Second Edition. Chapman and Hall/CRC, Boca Raton, Florida, USA.

My GitHub repository: https://github.com/hkallo/IODS-project

My GitHub course diary: https://hkallo.github.io/IODS-project/

# Date of submission
date()
## [1] "Fri Dec  1 18:24:54 2023"

Assignment 1

How are you feeling right now? Feeling good, looking forward to learn new things in this course. R for Health Data Science is truly great, and I have already after reading first sections, learnt several new things which will make my scripts better and more consise. The book has very nice explanations for everything

What do you expect to learn? I already have some experience with R statistical software and basics from statistical analysis. I expect to learn about open data sources and how to utilize them as well as about GitHub, which is completely new thing to me. It is also nice to get a recap of basic things of stats and perhaps find new ways to execute things in R.

Where did you hear about the course? I saw an email.

Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? As I already have some knowledge of R, it was pretty easy and straightforward. However, if I had no experience with this language, I don’t think I would learn too much as it is so detached from actual script writing, and nothing really had to be done yourself. The book seems very nice tool to check how to execute things in R, together with assignments better though!

Which were your favorite topics? Which topics were most difficult? Favorite chapter were pipe (%<%) and several parts from Summarizing data chapter. These helped me to understand the functions that I have already used but apparently without understanding them perfectly:) It will be easier to use this in the future. I have also tried RMarkdown before but it will be nice to learn more about it. I did not know about the cheet sheets in R, that was a great tip!

Some other comments on the book and our new approach of getting started with R Markdown etc.? I guess during upcoming weeks we get to do coding ourselves so no other comments:)


Assignment 2: Regression and model validation

Name: Henna Kallo

Date: 13.11.2023

In this exercise we learn to perform data wrangling and linear regression analysis!

#Assignment submitted
date()
## [1] "Fri Dec  1 18:24:54 2023"

Data wrangling: preparing dataset for analysis

library(tidyverse); library(dplyr); library(ggplot2); library(GGally) #load libraries

data <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t",  header = TRUE) #data upload

str(data) #structure of the object
dim(data) #dimensions of the data

Next we will create a new dataset for analysis containing only needed information: gender, age, attitude, deep, stratergic and surface level question scores, and points.

#save the variables ready in the original data frame to the new analysis data frame
learningAnalysis <- data %>%
  select(gender, Age, Attitude, Points)

#find combination variables (deep, strategic and surface level questions) and save each on their own variable
deep <- data %>%
  select(starts_with("D")) #NB! Excess amount of columns selected! With the next line we include only the deep question columns.
deep <- deep[,1:12]

surf <- data %>%
  select(starts_with("SU"))

stra <- data %>%
  select(starts_with("ST"))

#averaging the answers of selected questions and saving them to the analysis dataset
learningAnalysis$deep <- rowMeans(deep)
learningAnalysis$stra <- rowMeans(stra)
learningAnalysis$surf <- rowMeans(surf)

#scale the combination variable Attitude back to the 1-5 scale by dividing with 10, and delete the old variable
learningAnalysis$attitude <- learningAnalysis$Attitude / 10
learningAnalysis <- subset(learningAnalysis, select = -Attitude)

colnames(learningAnalysis) #check the column names and convert if needed
colnames(learningAnalysis)[2] <- "age"
colnames(learningAnalysis)[3] <- "points"

learningAnalysis <- filter(learningAnalysis, points>0) #include only the rows where points > 0

#reorder the columns:
learningAnalysis <- learningAnalysis %>%
  select(gender, age, attitude, deep, stra, surf, points)

#last check
str(learningAnalysis)
head(learningAnalysis)

Now we have dataframe prepared for the subsequent analysis. Let’s save the file to the IODS Project folder.

setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory to the IODS project folder
getwd() #check that it worked

write_csv(learningAnalysis, file= "learning2014.csv") #save file as csv

Data analysis: explore, analyze & interpret

Summary of the dataset

setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory

learningAnalysis <- read.csv("learning2014.csv", header=TRUE) #read file into R

str(learningAnalysis) #structure of the data frame
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learningAnalysis) #first rows of the data frame
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
summary(learningAnalysis) #summary of the variables
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Data description:

  • Research question: Does students’ gender/age/attitude/question scores impact the success in the course completion (gained points)? This is measured with several different level (deep, strategic, surface) questions. Set of questions are used to quantify attitude on scale 1-5. Points represent the points that students have gained from the course. The data also includes information of students age and gender.

  • Data frame structure: There are 166 observations in each variable. There are total 7 variables: gender (character), age (integer), attitude (numeric), deep (numeric), strategic (numeric) and surface (numeric) level questions, and points (integer).

  • The summary table above describes summaries of variables: min, max, mean, median and 1st and 3rd quartiles.

    • There are 110 females and 56 men in this study.

    • The students are aged from 17 up to 55.

    • Attitude scores gained range between 1.4-5.0, average being 3.14.

    • The mean scores of deep, strategic and surface questions are 3.68, 3.12 and 2.79, respectively.

    • Minimum points gained is 7.00, maximum points 33. Average of points is 22.72. 50% of all scores fit to the range 19.0-27.75 points.

Linear regression analysis: Is there association/ dependency between points (dependent variable) and age/attitude/deep/stra/surf (explanatory) variables?

To test this, we will perform regression analysis, which is a statistical method that describes the relationship between two or more variables.

Graphical overview of the data:

#Function for adding correlation panel
panel.cor <- function(x, y, ...){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- round(cor(x, y), digits=2)
  txt <- paste0("R = ", r)
  cex.cor <- 0.8/strwidth(txt) #if want to adjust text size based on R value
  text(0.5, 0.5, txt, cex = 1)} #if adjusting; cex=cex.cor

#Function for adding regression line
upper_panel_regression_line = function(x,y, ...){
  points(x,y,...)
  linear_regression = lm(y~x)
  linear_regression_line = abline(linear_regression)}

my_cols <- c("#00AFBB", "#E7B800") #set colors

learningAnalysis$gender<-as.factor(learningAnalysis$gender) #for being able to change color, convert gender to factor type

#Visualization with a scatter plot + regression line matrix, add R values to the lower side of the matrix.
pairs(learningAnalysis[-1], col = my_cols[learningAnalysis$gender],
      lower.panel = panel.cor , upper.panel = upper_panel_regression_line)

Scatter plot shows the distribution of the observations (female turquoise, male yellow). Regression lines give some indication whether there is or isn’t an association between two variables. For instance, there seems to be positive dependency between the attitude and points: line goes upwards and it is steeper than any other line. Also, R-value (correlation coefficient) for points vs. attitude is 0,44, suggesting positive correlation between these variables. If the slope is (close to) horizontal and R is (close to) zero, it means that there is no association between the variables. This kind of case is for example between deep and points variables. Then again, if R-value is negative and the slope of regression line is negative (line goes downhill), like between surf and points variables, the association is negative. This means that when the surface question gets higher value, the points are more likely lower. Thus, with this kind of modelling we can also predict the success at the course based on the answers to the preliminary questions. Better explanation of R-values and their meaning comes after the next figure.

Let’s check if analysis executed separately in male and female reveals differences in association of the variables.

# create a more advanced plot matrix with ggpairs()
ggpairs(learningAnalysis, mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

Let’s go through the figure. Red = women, Blue = men.

There are about two times more women in this study.

The boxplots reveal that the median age of women is a bit lower than that for men. Also, women median score of attitude is a bit lower. Strategic & surface question scores in turn are slightly higher in women.

Also density plots reveal the same. They also show that distribution of attitude and surface question scores are quite different in male and female. Density plots of age reveal that the data is right skewed. This means that there are more young participants than older ones. Age, Stra and surf have clearly unimodal distribution (=only one peak). Other variables have 1-2 peaks, sometimes depending on the gender. The scatter plot also shows the distribution of values. The same applies with the histograms. The clearest way to make conclusions from the distributions is still with the density plot.

As we are focused on studying the causal relationship between dependent variable ´points´ and explanatory variables age, attitude, deep, stra and surf, let’s focus on right most column of correlation coefficient values (measures linear correlation between two sets of data). This article presents the rule of thumb (Mukaka, 2012; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/) how to interpret different coefficient values. The strongest association is between attitude and points. This association is strong also in both genders separately. Interstingly, age seems to have stronger association with points in men than in women. This assocation is negative, meaning that older age is associated with lower scores. The next biggest impact seems to be with stra and surf variables, but their coefficients are already quite low and not statistically significant.

Multiple regression model

Now we select 3 explanatory variables to explain dependent variable “points”. This selection is based on the slopes of regression lines and R values in the figures above. The 3 highest absolute R values are selected: attitude (R=0.44), stra (R=0.15) and surf (R=-0.14) (genders not separated in the analysis).

#Multiple regression analysis
# create a regression model with multiple explanatory variables
my_model<- lm(points ~ attitude + stra + surf, data = learningAnalysis)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learningAnalysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

P-value of the F-statistic is 3.156e-08, which is highly significant. This means that at least one of the predictor variables (attitude, stra, surf) is significantly related to outcome variable.

To identify which predictor variables are significant, let’s examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values. Attitude is significantly associated with points whereas variables stra and surf show no association. Thus, we can remove these to variables from the model. Coefficient b for attitude is ~3.40, meaning that this is the average effect on our dependent variable (points) when the predictor (attitude) increases with one unit and all the other predictors do not change.

my_model_attitude<- lm(points ~ attitude, data = learningAnalysis)

# print out a summary of the model
summary(my_model_attitude)
## 
## Call:
## lm(formula = points ~ attitude, data = learningAnalysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

P-value of F-statistics is significant (4.119e-09), and the model tells that when attitude grows with 1 unit, points increase about 3.5 on average. The final equation would be: points = 11.6 + attitude*3.5.

Out of curiosity, before proceeding to quality assessment, let’s check if age should be included in the model when analyzing only explanatory variables of points in men.

menstudents <- learningAnalysis %>%
  filter(gender=="M")

my_model_men<- lm(points ~ attitude + age, data = menstudents)

summary(my_model_men) #summary of the model
## 
## Call:
## lm(formula = points ~ attitude + age, data = menstudents)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6798  -3.2162   0.5482   2.8711   9.3838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.66207    4.56036   2.996 0.004156 ** 
## attitude     4.10182    1.10353   3.717 0.000487 ***
## age         -0.16050    0.08465  -1.896 0.063418 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.285 on 53 degrees of freedom
## Multiple R-squared:  0.2538, Adjusted R-squared:  0.2256 
## F-statistic: 9.013 on 2 and 53 DF,  p-value: 0.0004272

Interestingly, this analysis indicates that age might be associated (negatively) with points in men: the older the men, the less points. P-value 0.06 is very close to statistical significance (p<0.05). However, let’s not continue further with this dataset but rather analyse both men and women together.

Quality assessment

Finally, we should perform quality assessment of the model, based on R-squared (R2) and Residual Standard Error (RSE). R2 is sensitive for the number of variables included in the model and it is adjusted to correct for the number of explanatory variables included in the prediction model. The adjusted R2 = 0.1856, meaning that “~19% of the variance in the measure of points can be predicted by attitude score. If we compare the adjusted R2 value to the previous model where we had 3 predictor variables, the values are not very different, meaning that having 3 predictors in the model did not improve the quality of the model.

#error rate
summary(my_model_attitude)$sigma/mean(learningAnalysis$points)
## [1] 0.2341752

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model is. Here we calculated the error rate by dividing the RSE by the mean of outcome variable. Thus, RSE 5.32 is corresponding to 23% error rate.

Last thing to do is to graphically explore the validity of our model assumptions by Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage plot. Residual is the difference between the observed value and the mean value that the model predicts for that observation.

# draw diagnostic plots using the plot() function
par(mfrow = c(2,2))
plot(my_model_attitude, which=c(1,2,5))
par(mfrow = c(1, 1)) #reset plotting matrix: 

  • Residuals vs Fitted values plot: detect unequal error variances, non-linearity, and outliers.
    • On the right end of x axis, there is some indication of heteroskedasticity: the spread of the residuals seems decreasing. However, the variance of the residuals seems otherwise somewhat even, so I conclude that this is ok amount of variation.
    • The horizontal band (red line) is formed around the zero line. Thus, it is reasonable to assume linear relationship.
    • There are 3 outliers. One option would be to further evaluate if (some of) these three observations should be removed from the analysis.
  • Normal QQ-plot: provide an indication of univariate normality of the dataset.
    • From the figure we observe that the normal probability plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.
  • Residuals vs Leverage plot: identify influential data points on the model.
    • There are three data points highlighted; one of them raised already in two previous plots as outlier. It might be good idea to further study the influence of these observations on the slope of the regression line (https://rpubs.com/Amrabdelhamed611/669768). However, as the points are not flagged by the Cook’s distance, they are most likely not too influental, and thus can be included in the analysis.

Conclusion: The final course points of the students are positively associated with the attitude scores based on the preliminary question asked before the course: the higher attitude score, the more points. Only one explanatory variable is included in the model as rest were not reaching significance. Quality assessment reavealed that our model is reliable.


Assignment 3: Logistic regression

Name: Henna Kallo

Date: 18.11.2023

In this exercise we learn to perform data wrangling and linear regression analysis!

We are exploring data from two questionnaires related to student performance in secondary education of two Portuguese schools. The data includes student grades, demographic, social and school related variables. Here we have combined data sets of Mathematics and Portuguese language subjects.

Here we study the relationships between alcohol consumption with selected variables.

Data source: http://www.archive.ics.uci.edu/dataset/320/student+performance

#import data
library(readr)
alc<-read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/student_performance_alcohol.csv") 
alc<-as.data.frame(alc);

colnames(alc) #column names
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc) #structure of the dataset
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : num  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

All the variables listed. The variables not used for joining the two data have been combined by averaging (including the grade variables). Alcohol use (‘alc_use’) is the average of workday and weekend alcohol consumption. If average is higher than 2, alcohol consumption is considered ‘high use’. Rest of the variables are describes in the website (link given above).

We have 370 obsrevations and 35 variables in the dataframe. There are character, numerical and logistic type of variables.

Next we select 4 intersting variables to further study their relationship with the alcohol consumption. Let’s visualize the variables:

library(tidyr); library(ggplot2); library(dplyr);
gather(alc) %>% ggplot(aes(value))  + 
  facet_wrap("key", scales = "free") + 
  geom_bar(fill="#00AFBB") +
  ggtitle("Barplots of all variables")

Below are listed the chosen 4 factors, brief description of variable, and expected relationship with alcohol consumption

  • Goout
    • going out with friends, scoring 1-5
    • hypothesis: there is a positive relationship between ‘goout’ and ‘high_use’: higher value of the goout is linked with heavier alcohol consumption.
  • Absences
    • number of school absences
    • hypothesis: there is a positive relationship between ‘absences’ and ‘high_use’: higher value of absences is linked with heavier alcohol consumption.
  • Failures
    • number of past class failures
    • hypothesis: there is a positive relationship between ‘failures’ and ‘high_use’: higher value of failures is linked with heavier alcohol consumption.
  • Romantic
    • with a romantic relationship (binary: yes or no)
    • hypothesis: there might be an association between variables ‘romantic’ and ‘high_use’: single status linked with high_use=TRUE.

Next we will see whether we can find support for our hypotheses with numerical and graphical exploration of data:

Density or frequency plots (depending on the variable), as well as cross-tabulations and plots to visualize

ggplot(alc, aes(x=goout)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("GoOut distribution") +
  theme_classic()

# create cross-tabulations and plots to visualize
library(sjPlot)

tab_xtab(var.row = alc$goout, var.col = alc$high_use, title = "Cross-Tab of GoOut & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of GoOut & High alcohol consumption
goout high_use Total
FALSE TRUE
1 19
86.4 %
3
13.6 %
22
100 %
2 82
84.5 %
15
15.5 %
97
100 %
3 97
80.8 %
23
19.2 %
120
100 %
4 40
51.3 %
38
48.7 %
78
100 %
5 21
39.6 %
32
60.4 %
53
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=55.574 · df=4 · Cramer’s V=0.388 · p=0.000
plot_xtab(alc$goout, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Density plot shown that most of the students get ‘goout’ scoring 2-4. Cross-tabulation shows that the bigger is the ‘goout’ score, the bigger proportion there is of observations with high_use=TRUE. This indicates that our hypothesis was correct: higher value of the goout is linked with heavier alcohol consumption.

ggplot(alc, aes(x=absences)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Absences distribution") +
  theme_classic()

tab_xtab(var.row = alc$absences, var.col = alc$high_use, title = "Cross-Tab of Absences & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of Absences & High alcohol consumption
absences high_use Total
FALSE TRUE
0 50
79.4 %
13
20.6 %
63
100 %
1 37
74 %
13
26 %
50
100 %
2 40
71.4 %
16
28.6 %
56
100 %
3 31
81.6 %
7
18.4 %
38
100 %
4 24
68.6 %
11
31.4 %
35
100 %
5 16
72.7 %
6
27.3 %
22
100 %
6 16
76.2 %
5
23.8 %
21
100 %
7 9
75 %
3
25 %
12
100 %
8 14
70 %
6
30 %
20
100 %
9 5
45.5 %
6
54.5 %
11
100 %
10 5
71.4 %
2
28.6 %
7
100 %
11 2
40 %
3
60 %
5
100 %
12 3
42.9 %
4
57.1 %
7
100 %
13 1
50 %
1
50 %
2
100 %
14 1
14.3 %
6
85.7 %
7
100 %
16 0
0 %
1
100 %
1
100 %
17 0
0 %
1
100 %
1
100 %
18 1
50 %
1
50 %
2
100 %
19 0
0 %
1
100 %
1
100 %
20 2
100 %
0
0 %
2
100 %
21 1
50 %
1
50 %
2
100 %
26 0
0 %
1
100 %
1
100 %
27 0
0 %
1
100 %
1
100 %
29 0
0 %
1
100 %
1
100 %
44 0
0 %
1
100 %
1
100 %
45 1
100 %
0
0 %
1
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=43.001 · df=25 · Cramer’s V=0.341 · Fisher’s p=0.004
plot_xtab(alc$absences, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Density plot shown that most of the student have absence score zero (right skewed). Cross-tabulation shows that the more absences there are, the bigger proportion tend to have higher alcohol consumption. This indicates that our hypothesis was correct: higher number of absences is linked with heavier alcohol consumption.

ggplot(alc, aes(x=failures)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Failures distribution") +
  theme_classic()

tab_xtab(var.row = alc$failures, var.col = alc$high_use, title = "Cross-Tab of failures & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of failures & High alcohol consumption
failures high_use Total
FALSE TRUE
0 238
73.2 %
87
26.8 %
325
100 %
1 12
50 %
12
50 %
24
100 %
2 8
47.1 %
9
52.9 %
17
100 %
3 1
25 %
3
75 %
4
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.002
plot_xtab(alc$failures, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

Also the density plot of failures is right skewed, meaning that most of the student pass the class Cross-tabulation shows that higher number of failures is linked with the bigger proportion of high alcohol consumption. This indicates that our hypothesis was correct: higher value of failures is linked with increase in heavy alcohol consumption.

ggplot(alc, aes(x=romantic)) +
  geom_bar(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Romantic relationship status frequencies") +
  theme_classic()

tab_xtab(var.row = alc$romantic, var.col = alc$high_use, title = "Cross-Tab of Romantic relationship status & High alcohol consumption", show.row.prc = TRUE)
Cross-Tab of Romantic relationship status & High alcohol consumption
romantic high_use Total
FALSE TRUE
no 173
68.9 %
78
31.1 %
251
100 %
yes 86
72.3 %
33
27.7 %
119
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=0.286 · df=1 · φ=0.034 · p=0.593
plot_xtab(alc$romantic, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)

The frequency table shows that there is about 1/3 of students in romantic relationship whereas ~2/3 are with single status. Cross-tabulation indicates that even though there is slightly bigger percentage of single + high alc use students, this difference is most likely not statistically meaningful. Thus, this results does not support our hypothesis stating ‘there might be an association between variables ’romantic’ and ‘high_use’’.

We are interested whether the alcohol consumption has an association with the chosen set of characteristics of the students. Binary logistic regression can tell us the probability of this. We choose binary logistic regression because the outcome variable, ‘high_use’, has two level (TRUE/FALSE). Explanatory variables can be other types as well.

#binary logistic regression
m <- glm(high_use ~ failures + absences + goout + romantic, data = alc, family = "binomial")

summary(m) # a summary of the model
## 
## Call:
## glm(formula = high_use ~ failures + absences + goout + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0472  -0.7510  -0.5266   0.8886   2.3474  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.55115    0.44211  -8.032 9.57e-16 ***
## failures     0.56492    0.22385   2.524 0.011612 *  
## absences     0.07762    0.02243   3.461 0.000538 ***
## goout        0.70634    0.11846   5.963 2.48e-09 ***
## romanticyes -0.35224    0.27737  -1.270 0.204111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 382.41  on 365  degrees of freedom
## AIC: 392.41
## 
## Number of Fisher Scoring iterations: 4

Let’s go through the output:

First we have the distribution of the deviance residuals.

The next we get the coefficients, their standard errors, the z-statistic, and the associated p-values. Failures, absences and goout are statistically significant. Variable ‘romantic’ is non-significant.

The logistic regression coefficients give the change in the log odds of the outcome (high_use) for a one unit increase in the predictor variable: + for every one unit change in failures, the log odds of high_use=yes increases by 0.56. + for every one unit change in absences, the log odds of high_use=yes increases by 0.08. + for every one unit change in goout, the log odds of high_use=yes increases by 0.71.

From the results we can construct the logistic regression equation (leave out statistically non-significant variable):

log(odds[high_use=yes]) = -3.55115 + 0.56492 * failures + 0.07762 * absences + 0.70634 * goout

Next we will compute the odds ratios (OR) and confidence intervalss (CI):

coef(m) # the coefficients of the model
## (Intercept)    failures    absences       goout romanticyes 
## -3.55114961  0.56492346  0.07762078  0.70634191 -0.35223716
OR <- coef(m) %>% exp # compute odds ratios (OR)

CI <-  confint(m) %>% exp # compute confidence intervals (CI)

cbind(OR, CI) # print out the odds ratios with their confidence intervals
##                     OR      2.5 %     97.5 %
## (Intercept) 0.02869164 0.01165358 0.06618211
## failures    1.75931312 1.14014676 2.75402869
## absences    1.08071276 1.03551949 1.13208308
## goout       2.02656433 1.61539789 2.57288191
## romanticyes 0.70311335 0.40369796 1.20112415

We can conclude the following: - for one unit increase in failures, the odds of having high alcohol consumption increases by factor of 1.76. (the outcome is 76% more likely) - for one unit increase in absences, the odds of having high alcohol consumption increases by factor of 1.08. (the outcome is 8% more likely) - for one unit increase in goout, the odds of having high alcohol consumption increases by factor of 2.03. (there is a doubling of the odds of the outcome)

CI is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR.

These results support our hypotheses of the effects of failures, absences and goouts, but not about the effect of romantic relationship status.

Next, we will explore the predictive power of the model. We will include only the statistically significant variables: failures, absences & goout.

m_pred <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")

probabilities <- predict(m_pred, type = "response") # predict the probability of high_use

alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'

alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, goout, high_use, probability, prediction) %>% tail(10)
##     failures absences goout high_use probability prediction
## 361        0        3     3    FALSE  0.21435314      FALSE
## 362        1        0     2    FALSE  0.15791281      FALSE
## 363        1        7     3     TRUE  0.38937775      FALSE
## 364        0        1     3    FALSE  0.19036196      FALSE
## 365        0        6     3    FALSE  0.25431749      FALSE
## 366        1        2     2    FALSE  0.17871716      FALSE
## 367        0        2     4    FALSE  0.33847888      FALSE
## 368        0        3     1    FALSE  0.06266341      FALSE
## 369        0        4     5     TRUE  0.54534711       TRUE
## 370        0        2     1     TRUE  0.05843362      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) # tabulate the target variable versus the predictions
##         prediction
## high_use FALSE TRUE
##    FALSE   237   22
##    TRUE     66   45
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ #plot of 'high_use' versus 'probability' in 'alc'
  geom_point()+
  ggtitle("Plotted confusion matrix results ")

The printout of the dataframe shows the new columns; predicted probabilities and prediction of high_use.

A confusion matrix visualizes and summarizes the performance of a classification algorithm: + true negatives: 237 + true positives: 45 + false positives (type I error): 22 + false negatives (type II error): 66

Next, let’s compute the average number of incorrect predictions. The mean of incorrectly classified observations can be thought of as a penalty (loss) function for the classifier. Less penalty = good.

table(high_use = alc$high_use, prediction = alc$prediction) %>% # tabulate the target variable versus the predictions
  prop.table() %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64054054 0.05945946 0.70000000
##    TRUE  0.17837838 0.12162162 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000
loss_func <- function(class, prob) { # define a loss function (mean prediction error)
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability) #compute the average number of wrong predictions in the (training) data
## [1] 0.2378378

This analysis revealed that the average number of wrong predictions is ~24%.

Now we continue to the 10-fold cross-validation of the model

# K-fold cross-validation
library(boot)
#trainingdata
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = nrow(alc))

#testingdata
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = 10)

# average number of wrong predictions
cv_train$delta[1]
## [1] 0.2405405
cv_test$delta[1]
## [1] 0.2432432

The comparison of the average number of the wrong predictions in training and testing sets, we see that the error is about the same. The error is slightly smaller than in the exercise set, meaning that including failures, absences and goouts is a bit better model to predict the alcohol use than what was used in the exercise set.


Assignment 4: Clustering & Classification

Name: Henna Kallo

Date: 26.11.2023

In this exercise we learn data clustering and classification

  • Data source: MASS package in R: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

  • Literature: Part IV of “MABS4IODS” (chapters 12, 17 & 18)

  • Important concepts:

    • Classification: organizing a large, complex set of multivariate data. Objects are sorted into a small number of homogeneous groups or clusters.
    • Multivariate data: several measurements or observations are made on each sampling unit, and they are considered simultaneously to reveal the patterns or so. No division to explanatory and dependent variables.
library(MASS); library(tidyr); library(corrplot); library(dplyr); library(ggplot2) #load libraries
rm(list=ls())
data("Boston") #load the data

Let’s take a look of the Boston data:

str(Boston) #structure of the dataset
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
  • Boston dataset consists multivariate data of Housing values in suburdbs of Boston. The data frame has 506 rows and 14 variables. All variables are numeric type. Brief description of the variables:
    • crim: crime rates
    • zn: proportions of residental land zoned for lots over 25000 sq. ft.
    • indus: proportion of other than retail business acres
    • chas: next to Charles River, or not
    • nox: pollution; nitrogen oxides
    • rm: number of rooms per apartment
    • age: proportion of owner-occupied units built before 1940
    • dis: distances to the employment centres (weighted mean)
    • rad: access to radial highways
    • tax: full-value property-tax rate
    • ptratio: student/teacher ratio
    • black: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.
    • lstat: lower status of the population (%)
    • medv: median value of owner-occupied homes in $1000s

With this data set we can explore connections between economical, environmental, societal and cultural features.

The descriptions of the variables are listed in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.

Let’s explore the summary statistics of the multivariate Boston data:

– each of the variables separately

– relationships between the variables (correlations)

summary(Boston) #summary of each variable separately
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  • Some interesting points:
    • Mean crime rate is 3.61 per capita. Variation is very large as min rate is almost 0, whereas maximum crime rate is close to 90 crimes per capita.
    • Nitrogen oxygen concentration vary from 0.39 to 0.87 part per 10 million.
    • Average number of rooms in apartment is about 6, but it varies from 3,5 to almost 9.
    • On average, almost 70% of owner-occupied units are built before 1940. However, there’s lots of variability.
    • full-value property-tax rate varies from 187 to 711 (pe $10000)
    • There are on average 19 students per teacher.
    • The amount of blacks varies a lot but their proportion is relatively high on most of the regions.
    • Lower status of the population varies between 1,73-37,97%. On average, a bit more than 20% have lower status.
    • Mean of the median values of owner occupied homes is 22530$.

The scatterplot of each variable-combination can be use as the first indicative visualization of associations between the variables. In addition to this, we print an illustrative correlation matrix visualization, which presents us nicely how strong is the association between the variable-pairs, and whether the association is positive or negative.

pairs(Boston,
      col = "cornflowerblue",
      main = "Scatterplots for each variable-combination of Boston data frame")

cor_matrix <- cor(Boston) %>% #correlation matrix
  round(digits=2)

corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) ##visualize the correlations

The crime rate seems to be slightly positively associated with proportion of owner-occupied units built prior to 1940 and lower status of the population. Also, as accessibility to radial highways gets better and property-tax rate increases, the crime rates per capita increases.

Not that surprisingly, there is an association between industry and nitrogen oxygen levels. Moreover, higher nitrogen oxygen concentration seems to be associated with higher proportion of owner-occupied units built prior to 1940 & lower population status. The scatterplot indicates that the air pollution concentration and industry variables are in turn negatively associated with the distance to the Boston employment centres and median value of owner-occupied homes.

Lower number of rooms/apartment seems to be linked with lower population status. As expected, average number of rooms is positively associated with the median value of owner occupied homes. Lower status of the population is clearly linked with reduced median value of owner-occupied homes.

Moreover, there is a negative association between proportion of owner-occupied unit built prior to 1940 with distance to employment centres, and a positive correlation between the accessibility to radial highways and full-value property-tax rate per $10000.

The variables are on very different scales. We will make them more comparable by standardizing the dataset.

boston_scaled <- scale(Boston) # center and standardize variables

summary(boston_scaled) # summaries of the scaled variables
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

When scaling the data, we subtract the column means from the corresponding columns and divide the difference with standard deviation. This is why the mean is 0 in all of the variables. After scaling (centering & standardizing) we can better compare the variables with each other. This website briefly lists the cons of centering and scaling the variables: https://www.goldsteinepi.com/blog/thewhyandwhenofcenteringcontinuouspredictorsinregressionmodeling/index.html

Classification

Linear discriminant analysis

“A further question that is often of interest for grouped multivariate data is whether or not it is possible to use the measurements made to construct a classification rule derived from the original observations (the training set) that will allow new individuals having the same set of measurements (the test sample).” -Part IV of “MABS4IODS” -> discriminant function analysis (chapter 18)

Now will perform linear discriminant analysis: the idea is to find a linear combination of features that characterizes or separates two or more classes of objects or events. When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.

Our target variable is crime rate per capita by town. Rest of the variables are predictor variables. Our interest lies in deriving a classification rule that could use measurements of the predictor variables to be able to identify the individual’s placement on the categories of crim variable.

We start with converting the crim variable to categorical and dividing the data into four categories: low, med_low, med_high and high crime rates per capita.

class(boston_scaled) # class of the boston_scaled object
## [1] "matrix" "array"
boston_scaled<-as.data.frame(boston_scaled) # change the object to data frame

#Create a factor variable from numerical: binning by quantiles; variable `crim` (per capita crime rate by town)

summary(boston_scaled$crim) #summary
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins<-quantile(boston_scaled$crim) #save quantiles, bin limits, in 'bins'
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Next we’ll divide the data into train (80% of the data) and test sets.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data; save for later to calculate how well the model performed in prediction
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

And finally perform the analysis and plot the results:

set.seed(123)
lda.fit <- lda(crime~., data = train)

lda.fit # print the lda.fit object
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2623762 0.2524752 0.2376238 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       1.0670560 -0.9205618 -0.11484506 -0.9079839  0.4423969 -0.9538539
## med_low  -0.1130765 -0.2441408 -0.01233188 -0.5414604 -0.1475648 -0.3472909
## med_high -0.3762641  0.1564954  0.22945822  0.3209185  0.1418128  0.4113147
## high     -0.4872402  1.0172418 -0.06727176  1.0680511 -0.4586557  0.7833759
##                 dis        rad        tax     ptratio       black         lstat
## low       0.9467169 -0.6775274 -0.7199432 -0.44459945  0.37888625 -0.7702864949
## med_low   0.2961775 -0.5506543 -0.4464689 -0.05528293  0.31532521 -0.1276760444
## med_high -0.3406808 -0.4132674 -0.3090720 -0.27200081  0.09400468 -0.0003191301
## high     -0.8354530  1.6368728  1.5131579  0.77931510 -0.82222326  0.8725008995
##                  medv
## low       0.534754595
## med_low  -0.005926261
## med_high  0.190441185
## high     -0.718239451
## 
## Coefficients of linear discriminants:
##                   LD1          LD2           LD3
## zn       8.865764e-02  0.769129355 -0.8441691517
## indus   -8.773471e-03 -0.166808275  0.2151046484
## chas    -9.060755e-02 -0.075092458  0.0472380273
## nox      4.674199e-01 -0.578236892 -1.4164048073
## rm      -1.080262e-01 -0.080887276 -0.1824442143
## age      1.935988e-01 -0.344770321 -0.3097046168
## dis     -4.492858e-02 -0.165113766 -0.0935157191
## rad      3.144849e+00  0.991686927  0.0007720325
## tax      6.750153e-05 -0.127063632  0.5903339453
## ptratio  9.937571e-02  0.007799096 -0.2798178374
## black   -1.442695e-01  0.026234323  0.1668073665
## lstat    2.020085e-01 -0.187879456  0.2899838916
## medv     1.730222e-01 -0.355180840 -0.2962872788
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9437 0.0426 0.0137
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
                   x1 = myscale * heads[,choices[1]], 
                   y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.2)

  • The biplot:
    • The cosine of the angle between a vector and an axis indicates the importance of the contribution of the corresponding variable to the principal component
    • variables are shown as arrows from the origin and observations are shown as points. The configuration of arrows reflects the relations of the variables. The cosine of the angle between the arrows reflects the correlation between the variables they represent. The most influential linear separators for the clusters are zn, rad & nox.

Next, we use predict() to classify the LDA-transformed testing data. Finally, we calculate the accuracy of the LDA model by comparing the predicted classes with the true classes.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      15        1    0
##   med_low    3      14        3    0
##   med_high   0       7       16    1
##   high       0       0        0   31

Let’s calculate the accuracy of the prediction: (15+15+19+31)/102 * 100% = ~78% of the predictions are correct ( ~22% of observations are misclassified).

Clustering

Cluster analysis is a generic term for a wide range of numerical methods with the common goal of uncovering or discovering groups or clusters of observations that are homogeneous and separated from other groups. Clusters are identified by the assessment of the relative distances between points. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points.

Classifcation starts with calculating interindividual distance matrix or similarity matrix. There are many ways to calculate distances or similarities between pairs of individuals, here we use Euclidean distance. After calculating distances, we proceed to run the k-means algorithm.

K-means clustering is a unsupervised method, that assigns observations to groups or clusters based on similarity of the objects

data("Boston") #load the data again

boston_scaled_2 <- scale(Boston) # scaling variables
boston_scaled_2<-as.data.frame(boston_scaled_2) # change the object to data frame

dist_eu <- dist(boston_scaled_2) # euclidean distance matrix

summary(dist_eu) # look at the summary of the distances
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <- kmeans(boston_scaled_2, centers = 3) #centers = number of clusters

# plot the Boston dataset with clusters
pairs(boston_scaled_2, col = km$cluster) 

Summary table of eucledian distances show the min, 1st quartile, median, mean, 3rd quartile and maximum of the distances between two points.

Let’s determine the optimal number of clusters

When plotting the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically

#K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled_2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend at k = 6. This bend indicates that additional clusters beyond the sixth have little value. In the next section, we’ll classify the observations into 6 clusters. (more info: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)

km <- kmeans(boston_scaled_2, centers = 6) # k-means clustering

print(km)
## K-means clustering with 6 clusters of sizes 49, 47, 34, 78, 124, 174
## 
## Cluster means:
##         crim         zn      indus       chas        nox         rm         age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554  1.5399833  0.07933756
## 2 -0.2834643 -0.4872402  1.5965043 -0.2723291  1.0453673 -0.6245590  0.94375129
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.37213224
## 4 -0.4126954  1.9952361 -1.1032525 -0.2218534 -1.1552982  0.6080657 -1.40363754
## 5  1.1156495 -0.4872402  1.0149946 -0.2723291  0.9916473 -0.4276828  0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
##          dis          rad         tax    ptratio       black      lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738  0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809  0.03772813 -0.2084479 -0.08717824  0.7185475
## 3 -0.4033382  0.001081444 -0.09756330 -0.3924585  0.17154271 -0.1643525
## 4  1.5772490 -0.627024335 -0.57588304 -0.7161406  0.35335146 -0.9028481
## 5 -0.8170870  1.659602895  1.52941294  0.8057784 -0.81154619  0.9129958
## 6  0.2769540 -0.587168164 -0.59021294  0.1702603  0.30993538 -0.1365045
##         medv
## 1  1.6147330
## 2 -0.5152915
## 3  0.5733409
## 4  0.6621944
## 5 -0.7713403
## 6 -0.1747229
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   6   1   1   1   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   4 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   4   6   6   6   6   6   6   6   6   6   6   6   4   4   4   4   4   4   4   6 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   6   6   6   4   4   4   4   6   6   6   6   6   6   6   6   6   6   6   6   6 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   4   6   6   6   6   6   6   6   1   1   6   6   6   6   6   6   6   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   3   2   2   2   2   2   2   2   2   2   3   2   3   3   2   1   2   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   3   1   3   3   2   2   1   2   2   2   2   2   6   6   6   1   6   6   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   1   1   1   6   6   1   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   4   4   4   4   4   6   6   6   3   3   3   3   3   6   6   6   3   1   3   3 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   3   3   1   1   1   1   1   1   1   6   1   1   1   3   6   3   1   4   4 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   4   6   4   4   6   6   4   6   6   4   4   4   4   4   4   4   4   1   1   1 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   1   1   1   1   1   6   1   1   1   3   6   6   6   3   3   4   3   3   4   1 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   1   3   4   4   4   4   4   4   4   4   4   4   6   6   6   6   6   4   4 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   4   4   4   4   1   6   1   1   6   6   6   6   6   6   6   6   6   6   6   6 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   6   6   6   6   6   6   6   6   6   6   6   4   4   6   6   6   6   6   6   6 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   6   4   6   4   4   6   6   4   4   4   4   4   4   4   4   4   3   3   3   5 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   5   5   5   3   3   5   5   5   5   3   3   5   3   5   5   5   5   5   5   5 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   5   5   5   5   5   5   5   5   2   2   2   2   2   6   6   6   6   6   6   6 
## 501 502 503 504 505 506 
##   6   6   6   6   6   6 
## 
## Within cluster sum of squares by cluster:
## [1] 209.5683 263.8282 340.7321 363.4559 984.9444 590.5735
##  (between_SS / total_SS =  61.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
  • The k-means printed output displays:
    • sizes of the clusters
    • the cluster means or centers: a matrix, which rows are cluster number (1 to 6) and columns are variables
    • the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

Let’s visualize the clusters with the pairs() function:

pairs(boston_scaled_2, col = km$cluster)

  • Here are some observations of the clustering visualization:
    • better accessibility to radial highways and property-tax rate increase seem to cluster together with high crime rate
    • lower median value of owner-occupied homes are classified to the same clusters with increasing crime rates
    • Variables industry and nitrogen form at least 3 clear clusters together.
    • Low status of the population is clustered with reduced median value of owner-occupied homes.
Bonus task:
data("Boston")
boston_scaled_bonus <- scale(Boston) # scaling
boston_scaled_bonus<-as.data.frame(boston_scaled_bonus) # change the object to data frame

#clusters: km$cluster

#LDA
set.seed(123)
lda.fit_bonus <- lda(km$cluster~., data = boston_scaled_bonus)

lda.fit_bonus # print the lda.fit object
## Call:
## lda(km$cluster ~ ., data = boston_scaled_bonus)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6 
## 0.09683794 0.09288538 0.06719368 0.15415020 0.24505929 0.34387352 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm         age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554  1.5399833  0.07933756
## 2 -0.2834643 -0.4872402  1.5965043 -0.2723291  1.0453673 -0.6245590  0.94375129
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.37213224
## 4 -0.4126954  1.9952361 -1.1032525 -0.2218534 -1.1552982  0.6080657 -1.40363754
## 5  1.1156495 -0.4872402  1.0149946 -0.2723291  0.9916473 -0.4276828  0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
##          dis          rad         tax    ptratio       black      lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738  0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809  0.03772813 -0.2084479 -0.08717824  0.7185475
## 3 -0.4033382  0.001081444 -0.09756330 -0.3924585  0.17154271 -0.1643525
## 4  1.5772490 -0.627024335 -0.57588304 -0.7161406  0.35335146 -0.9028481
## 5 -0.8170870  1.659602895  1.52941294  0.8057784 -0.81154619  0.9129958
## 6  0.2769540 -0.587168164 -0.59021294  0.1702603  0.30993538 -0.1365045
##         medv
## 1  1.6147330
## 2 -0.5152915
## 3  0.5733409
## 4  0.6621944
## 5 -0.7713403
## 6 -0.1747229
## 
## Coefficients of linear discriminants:
##                   LD1         LD2          LD3         LD4         LD5
## crim    -0.0305347753 -0.08591377 -0.108258758  0.01370669  0.12793637
## zn      -0.2349745638 -0.08924451 -0.895982678 -1.47547778  0.34969495
## indus    0.1256678211 -0.64511317  1.388877801 -1.77654417  0.39972164
## chas     5.8325152872  0.24741612 -0.311207345 -0.07443844 -0.27828298
## nox     -0.0025874888  0.27316108  0.279998181 -0.33251185  0.43887081
## rm      -0.0806099076 -0.01151686 -0.148388969  0.12537565  0.56954956
## age      0.0721570288 -0.01868907  0.362844031  0.22642315  0.24959116
## dis      0.0698432704  0.27583463 -0.218044854 -0.54697937 -0.14293536
## rad      0.2608126124 -3.31540916 -1.507829403  1.14007512  0.04287448
## tax     -0.0049952119  0.23295413 -0.381672532 -0.48815038 -0.21073890
## ptratio -0.0007726087  0.08817263  0.041915486 -0.04862072 -0.32527265
## black    0.0078197790  0.09331354 -0.003179723  0.02328372 -0.03306096
## lstat   -0.1245078745 -0.13037037 -0.008198020 -0.03449407  0.25315537
## medv    -0.1868981610  0.34078428  0.026517982  0.20032076  0.88234880
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5 
## 0.6255 0.2484 0.0727 0.0387 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
                   x1 = myscale * heads[,choices[1]], 
                   y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes_bonus <- as.numeric(km$cluster)

# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes_bonus, pch = classes_bonus)
lda.arrows(lda.fit_bonus, myscale = 1.2)

The most influential linear separators for the clusters are chas, rad & indus.

Super-Bonus:
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

The 3D plot helps with the separation of clusters that are overlapping on two axes.

###FIN###


Assignment 5: Dimensionality reduction techniques

Name: Henna Kallo

Date: 27.11.2023

In this exercise we learn two dimensionality reduction techniques: Principal component analysis (PCA) & Multiple correspondence analysis (MCA)

In this exercise we will study The Human Development Index (HDI) dataset, which was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone.

library(readr); library(dplyr); library(tibble); library(GGally); library(corrplot); library(ggplot2); library(factoextra); library(FactoMineR); library(tidyr); # load libraries

rm(list=ls()) # clear variables

getwd()
setwd("C:/Users/Henna/Desktop/IODS/IODS-project/Data")  # set working directory

human <- read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/human.csv") # load the data

str(human)
  • The data set consists of 155 observations and 9 variables.

  • Variables of the dataset:

    • Country = Country name
    • Edu2.FM = Edu2.F / Edu2.M: ratio of female and male populations with secondary education in each country
    • Labo.FM = Labo.F / Labo.M: ratio of labor force participation of females and males in each country
    • Life.Exp = Life expectancy at birth
    • Edu.Exp = Expected years of schooling
    • GNI = Gross National Income per capita
    • Mat.Mor = Maternal mortality ratio
    • Ado.Birth = Adolescent birth rate
    • Parli.F = Percetange of female representatives in parliament

Next we will calculate summaries of the variables…

# move the country names to rownames
human_ <- column_to_rownames(human, "Country")

# summaries of the variables
summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
  • Summary
    • Mean of ratio of female and male populations with secondary education in each country is 0.85 showing that there are more men with secondary education. 3rd quartile reveals that in about 75% (3rd quartile ~1.0) of the comparisons men have bigger proportions with secondary eduction.
    • Also, proportion of labor force participation is higher in men than in women. There are not many countries in which it would be higher in women (max=1.04).
    • Mean Life expectancy at birth is about 72 years, ranging from 49 up to 83.50 between the countries.
    • The highest Expected years of schooling is 20.20 years whereas the minimum is 5.40. Mean value of it is 13.18.
    • Gross National Income per capita mean is 17628 but there’s large variation between the countries: minimum being 581, and maximum 123124.
    • Maternal mortality ratio also has massive variation between the countries, the same goes with the adolescent birth rate.
    • Percetange of female representatives in parliament varies from0% up to almost 60%.

Overall, this summary indicates that there is large variability between the countries studied.

… and explore the data and the relations between the variables garphically.

ggpairs(human_, progress=FALSE) # visualize the 'human_' variables

cor(human_) %>% # compute the correlation matrix and visualize it with corrplot
  corrplot()

  • Variable distributions
    • Many of the variables are skewed left or right. For instance, most of the values of Gross National Income per capita are found from the left side (smaller values). The most closely resembling normally distributed data is Expected years of schooling.
  • Associations between the variables
    • The scatterplots reveal positive and negative associations between the variables in several cases. For example, it seems that there is a positive association between Maternal mortality ratio and Adolescent birth rate, and between Life expectancy at birth and Expected years of schooling. Just to mention few. When checking the correlation coefficients, we notice that there seems to be an association in most of the comparisons (statistically significant association is marked with a star). The corrplot is a nice way to visualize the correlations. The bigger is the “ball” and the darker is its color, the stronger is the association. Red color is linked with negative association, blue with positive association. To mention one example; there is a strong negative correlation between Life expectancy at birth and Maternal mortality ratio (R=-0.857): the higher is Life.Exp, the lower is maternal mortality ratio. The associations above make sense as they are commonly kept as indicators of welfare society and surely are linked together.
  • As there are many explanatory variables relative to the number of observations, and the explanatory variables are highly correlated, we have a suitable dataset for upcoming principal component analysis.

Principal component analysis (PCA)

Perform principal component analysis (PCA) on the raw (non-standardized) human data, explore the variablity and visualize with the biplot and arrows representing the original variables.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.5, 1), col = c("grey40", "deeppink2"))

The Gross National Income per capita seems to be by far the most important component. However, we remember from the summary inspection that the variability of the GNI was massively larger than in any other variables. This is most likely going to hide the effects of the other variables.

Let’s proceed with standardizing the variables in the human data and repeat the above analysis.

human_std <- scale(human_) # standardize the variables

summary(human_std) # print out summaries of the standardized variables
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human <- prcomp(human_std) # perform principal component analysis (with the SVD method)

s <- summary(pca_human) # create and print out a summary of pca_human

pca_pr <- round(1*s$importance[2, ], digits = 3) * 100# rounded percentages of variance captured by each PC

print(pca_pr) # print out the percentages of variance
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)") # create object pc_lab to be used as axis labels

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pca_pr[1], ylab = pca_pr[2],
       xlim=c(-.4, .4),
       main='PCA - Results 1',
       sub='Principal components 1 and 2 explain 69.8% of the total variance between the countries.',
       expand=1.2)

#fviz_cos2(pca_human, choice = "var", axes = 1:2) #The goal of the visualization is to see how much each variable is represented in a given component. Such a quality of representation is called the Cos2 and corresponds to the square cosine, and it is computed using the fviz_cos2 function.

fviz_pca_var(pca_human, col.var = "cos2",
            gradient.cols = c("black", "orange", "green"),
            repel = TRUE,
            title='PCA - Results 2')

#http://agroninfotech.blogspot.com/2020/06/biplot-for-pcs-using-base-graphic.html#add-axis-title-and-labels
  • Summaries of the standardized variables reveal that now the ranges of values are on the same scales. The means are centered to zero. Standardization ensures that each variable has the same level of contribution, preventing one variable from dominating others. The results and interpretations are very different between the the standardized and non-standardized data sets. Always carefully inspect the data before the final conclusions!

  • Next we have printed out the percentages of the variances (PC components). We see what is the effect of each PC component. The first principal component explains ~53.6% of the total variance, the second one 16.2%. Together they cover 69.8% of the total variance. (https://www.datacamp.com/tutorial/pca-analysis-r).

With a biplot we visualize the similarities and dissimilarities between the samples, and the impact of each attribute on each of the principal components.

  • PCA - Results 1
    • First, all the variables that are grouped together are positively correlated to each other. We see that ‘Percentage of female representatives in parliament’ and ‘ratio of labor force participation of females and males in each country’ are close to each other, ‘Maternal mortality ratio’ & ‘Adolescent birth rate’ are close to each other, and rest four of the variables are close to each other.
    • Variables that are negatively correlated are displayed to the opposite sides of the biplot’s origin (such as ‘Maternal mortality ratio’ and ‘Life expectancy at birth’).
    • The higher the distance between the variable and the origin, the better represented that variable is. For instance, ‘Maternal mortality ratio’ is better represented than ‘Adolescent birth rate’.
    • Let’s take an example of Nordic countries (Finland, Norway, Iceland, Sweden, Denmark) grouped in the upper left side of the plot: these countries share high values in Percentage of female representatives in parliament, in Ratio of labor force participation of females and males, Expected years of schooling, Life expectancy at birth, Gross National Income per capita and Ratio of female and male populations with secondary education. These countries seem to have low ‘Maternal mortality ratio’ & ‘Adolescent birth rate’. It was expected that these variables present them this way in these welfare societies.
  • PCA - Results 2: Contribution of each variable
    • From the illustration Cos2 of variables to Dim1-2, ‘Maternal mortality ratio’, ‘Life expectancy at birth’, ‘Expected years of schooling’ & ‘Adolescent birth rate’ are the top four variables with the highest cos2, hence contributing the most to PC1 and PC2.
    • High cos2 attributes are colored in green: ‘Maternal mortality ratio’ & ‘Life expectancy at birth’. Mid cos2 attributes have an orange color: ‘Adolescent birth rate’ & ‘ratio of labor force participation of females and males in each country’. Finally, low cos2 attributes have a black color.

Questionnaire on tea: MCA analysis

In this study 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). On top of this, 4 questions of personal detailes were asked.

In this task we are practicing how to use the MCA dimensionality reducing technique. This approach is similar to PCA but it’s adjusted for discrete variables.

Load the tea dataset and convert its character variables to factors. Explore the data briefly:

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

#View(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

The tea data consists of 300 observations and 36 variables. All of the variables except age, are factor type with 2-7 levels. Age is integer type.

  • Let’s select some variables to explore them more:
    • Tea: 3 levels: “black”,“Earl Grey”, “green”
    • How: 4 levels “alone”,“lemon”, “milk”, “other”
    • how: 3 levels: “tea bag”, “tea bag+unpackaged”, “unpackaged”
    • sugar: 2 levels: “sugar”, “No.sugar”
    • where: 3 levels: “chain store”, “chain store+tea shop”, “tea shop”
    • spirituality: 2 levels: “Not.spirituality”, “spirituality”
    • healthy: 2 levels: “healthy”,“Not.healthy”
    • frequency: 4 levels: “+2/day”,“1 to 2/week”, “1/day”, “3 to 6/week”

Next, we will visualize the variables:

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "spirituality", "healthy", "frequency")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)

summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where               spirituality        healthy   
##  chain store         :192   Not.spirituality:206   healthy    :210  
##  chain store+tea shop: 78   spirituality    : 94   Not.healthy: 90  
##  tea shop            : 30                                           
##                                                                     
##        frequency  
##  +2/day     :127  
##  1 to 2/week: 44  
##  1/day      : 95  
##  3 to 6/week: 34
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

  • Frequency: most of the people have tea at least twice per day.
  • About 2/3 have healthy tea.
  • Most of the people who responded use teabags instead of unpacked.
  • Most of the individuals drink their tea without adding lemon, milk or other.
  • About 1/3 of individuals have some kind of spiritual aspect with their tea time.
  • No major difference in numbers of individuals who have their tea with or without sugar.
  • Most of the responders have Earl Grey, black tea is on the second place.
  • Majority buy their tea from a chain store.

Next we will use Multiple Correspondence Analysis (MCA) on the tea data of selected columns. With this analysis we aim to detect groups of individuals with similar profile in their answers to the questions, and associations between variable categories. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.221   0.205   0.177   0.153   0.144   0.136   0.127
## % of var.             11.775  10.932   9.462   8.173   7.677   7.279   6.760
## Cumulative % of var.  11.775  22.707  32.169  40.342  48.019  55.298  62.058
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.116   0.107   0.105   0.101   0.091   0.082   0.063
## % of var.              6.207   5.717   5.616   5.400   4.857   4.348   3.336
## Cumulative % of var.  68.265  73.982  79.598  84.998  89.856  94.204  97.540
##                       Dim.15
## Variance               0.046
## % of var.              2.460
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.390  0.229  0.134 |  0.057  0.005  0.003 | -0.417
## 2                  | -0.248  0.093  0.041 | -0.091  0.013  0.005 | -0.786
## 3                  | -0.184  0.051  0.048 | -0.058  0.005  0.005 | -0.398
## 4                  | -0.570  0.490  0.314 | -0.153  0.038  0.023 |  0.209
## 5                  | -0.238  0.085  0.049 | -0.056  0.005  0.003 |  0.087
## 6                  | -0.409  0.252  0.209 | -0.033  0.002  0.001 | -0.359
## 7                  | -0.211  0.067  0.030 |  0.049  0.004  0.002 | -0.105
## 8                  | -0.346  0.181  0.061 |  0.129  0.027  0.008 | -0.635
## 9                  |  0.499  0.377  0.138 | -0.520  0.440  0.150 |  0.108
## 10                 |  0.814  1.000  0.449 | -0.389  0.245  0.102 | -0.467
##                       ctr   cos2  
## 1                   0.326  0.154 |
## 2                   1.162  0.408 |
## 3                   0.298  0.227 |
## 4                   0.082  0.042 |
## 5                   0.014  0.007 |
## 6                   0.242  0.161 |
## 7                   0.021  0.007 |
## 8                   0.757  0.204 |
## 9                   0.022  0.007 |
## 10                  0.410  0.148 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.542   4.099   0.096   5.360 |   0.145   0.315   0.007
## Earl Grey          |  -0.212   1.634   0.081  -4.918 |  -0.200   1.565   0.072
## green              |   0.024   0.004   0.000   0.144 |   0.844   4.776   0.088
## alone              |  -0.080   0.235   0.012  -1.885 |   0.188   1.401   0.066
## lemon              |   0.568   2.007   0.040   3.451 |   0.035   0.008   0.000
## milk               |  -0.230   0.631   0.014  -2.055 |  -0.367   1.727   0.036
## other              |   1.265   2.717   0.049   3.846 |  -1.633   4.878   0.082
## tea bag            |  -0.653  13.661   0.557 -12.903 |  -0.039   0.052   0.002
## tea bag+unpackaged |   0.737   9.642   0.248   8.611 |  -0.658   8.271   0.198
## unpackaged         |   1.156   9.086   0.182   7.384 |   1.901  26.449   0.493
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                1.431 |  -0.858  12.784   0.241  -8.486 |
## Earl Grey           -4.638 |   0.427   8.274   0.329   9.922 |
## green                5.129 |  -0.575   2.566   0.041  -3.498 |
## alone                4.431 |  -0.085   0.335   0.014  -2.014 |
## lemon                0.214 |   1.314  13.391   0.214   7.990 |
## milk                -3.274 |  -0.242   0.864   0.016  -2.154 |
## other               -4.965 |  -1.277   3.445   0.050  -3.882 |
## tea bag             -0.767 |  -0.186   1.388   0.045  -3.687 |
## tea bag+unpackaged  -7.685 |   0.239   1.260   0.026   2.791 |
## unpackaged          12.139 |   0.257   0.557   0.009   1.638 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.101 0.109 0.335 |
## How                | 0.099 0.131 0.256 |
## how                | 0.572 0.570 0.045 |
## sugar              | 0.116 0.000 0.296 |
## where              | 0.626 0.660 0.087 |
## spirituality       | 0.001 0.038 0.146 |
## healthy            | 0.016 0.038 0.138 |
## frequency          | 0.234 0.094 0.115 |
#To visualize the percentages of inertia explained by each MCA dimensions
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 12), title="Percentages")

# visualize MCA with the biplot of individuals and variable categories:
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali", title="MCA Biplot")

fviz_mca_var(mca, col.var = "cos2",
            gradient.cols = c("black", "orange", "green"),
            repel = TRUE,
            title='MCA - Results')

fviz_ellipses(mca, c("where", "how"),
              geom = "point")

Summary output reveals that Dim1 explains 11.775% and Dim2 10.932% of the variance. The tables also show which of the categorical variables contribute the most on the dimensions. The “Percentages” plots visualizes this.

  • The biplot interpretations:
    • The distance between points gives a measure of their similarity (or dissimilarity). Points with similar profile are closed on the factor map. The plot above helps to identify variables that are the most correlated with each dimension.
    • Tea shop and unpackaged tea are correlated on both dimensions. Tea bag+unpackaged is correlated with chain store + tea shop with dimension 1.

The last two plots again highlight how these two variables, “where” and “how” contibute to the analysis.

###FIN###